############################ Before Starting ############################
# It is adviseble using the nbextension Table of Contents to better #
# navigate throuh the notebook with maximum level of nested sections to #
# display on the tables of contents #
#########################################################################
Author: Alessandro Arnone
Akadelivers es una empresa de reparto a domicilio especializada en la entrega de paquetes en menos de 1 hora, lo que se denomina (Q-commerce = Quick commerce) Esta empresa tiene una aplicación móvil con la que sus usuarios pueden elegir entre un catálogo de productos de tiendas locales de su ciudad y que les sean entregados en menos de 10 minutos a la dirección que deseen.
Cuando un usuario pide un pedido a través de Akadelivers se le cobra directamente el coste total (coste del producto + gastos de servicio + gastos de envío). Una vez el usuario ha pagado un producto, el repartidor que se encuentre más próximo a la tienda que tiene el producto se acerca a esta, paga el producto, lo recoge y lo lleva a la dirección que el usuario ha elegido. Akadelivers se lo llevara a la dirección indicada.

order_id: Número de identificación del pedido.
local_time: Hora local a la que se realiza el pedido.
country_code: Código del pais en el que se realiza el pedido.
store_address: Número de tienda en a la que se realiza el pedido.
payment_status: Estado del pedido.
n_of_products: Número de productos que se han comprado en ese pedido.
products_total: Cantidad en Euros que el usuario ha comprado en la app.
final_status: Estado final del pedido (este será la variable 'target' a predecir) que indicara si el pedido será finalmente entregado o cancelado. Hay dos tipos de estado:
In this section I want to give immediatly and insights about the results and the methodology used.
In order to execute the assignement the following assumption have been taken: -there is no 'minimum order' for the transaction (Products_total variable) -Cancelled payment does not mean 'cancelled delivered' ( Status payment variable)
A variable selection has been made based on the relationship between the prediction hace towards the final status. Results are collected at each section
import pandas as pd
from datetime import datetime
#statis
from scipy.stats import chi2_contingency
#sklearn
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold, cross_val_predict
from sklearn.model_selection import KFold, train_test_split,GridSearchCV,StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier,AdaBoostClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, balanced_accuracy_score
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold, RandomizedSearchCV
# feat importance
import dalex as dx
#imblearn
from imblearn.pipeline import Pipeline as imbpipeline
from imblearn.pipeline import make_pipeline
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE, ADASYN
#plotting
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.rcParams['figure.figsize'] = (16, 9)
pd.options.display.float_format = '{:.3f}'.format
#formatting
from IPython.display import display
pd.options.display.float_format = '{:,.2f}'.format
URL_TRAIN='https://challenges-asset-files.s3.us-east-2.amazonaws.com/data_sets/Data-Science/4+-+events/jump2digital/dataset/train.csv'
URL_TEST='https://challenges-asset-files.s3.us-east-2.amazonaws.com/data_sets/Data-Science/4+-+events/jump2digital/dataset/test_X.csv'
trainSet=pd.read_csv(URL_TRAIN)
Before to proceed to the count of different order, we need to verify if in the 54330 observations, there are no duplicates
trainSet.nunique()
order_id 54330 local_time 32905 country_code 23 store_address 5627 payment_status 3 n_of_products 27 products_total 3904 final_status 2 dtype: int64
Once verified we can proceed to the count of the order_id grouped by country
top3Country=trainSet.groupby('country_code')['order_id'].count().sort_values(ascending=False)[0:3].reset_index().rename(columns = {'order_id':'total_orders'})
The top 3 countries by order are: Argentina, Spain and Turkey
top3Country
| country_code | total_orders | |
|---|---|---|
| 0 | AR | 11854 |
| 1 | ES | 11554 |
| 2 | TR | 5696 |
trainSet['local_time'] = pd.to_datetime(trainSet['local_time'])
trainSet['hour']= pd.to_datetime(trainSet['local_time'], format='%H:%M:%S').dt.hour
ordersByHour=trainSet[trainSet['country_code']=='ES'].groupby('hour')['order_id'].count().sort_values(ascending=False)
hourSpain=trainSet[trainSet['country_code']=='ES']['hour']
plt.hist(hourSpain,bins=24)
plt.xlabel('Hour')
plt.ylabel('Count of Orders')
plt.title('Histogram of Order by hour')
plt.xlim(0, 24)
plt.grid(True)
plt.show()
The busiest hours are around dinner time which goes around 19-20-21 with the timespam from 20:00:00 to 20:59:59 being the busiest
ordersByHour
averageShop12513_complete=trainSet[trainSet['store_address']==12513]['products_total'].mean()
averageShop12513_onlyDelivered=trainSet[(trainSet['store_address']==12513) & (trainSet['final_status']=='DeliveredStatus')]['products_total'].mean()
print('The average price for order of ID 12513 is:',round(averageShop12513_complete,2), '[Considering all the orders]')
print('The average price for order of ID 12513 is:',round(averageShop12513_onlyDelivered,2), '[Considering only Completed orders]')
The average price for order of ID 12513 is: 17.39 [Considering all the orders] The average price for order of ID 12513 is: 17.38 [Considering only Completed orders]
Teniendo en cuenta los picos de demanda en España, si los repartidores trabajan en turnos de 8horas.
Qué porcentaje de repartidores pondrías por cada turno para que sean capaces de hacer frente a los picos de demanda. (ej: Turno 1 el 30%, Turno 2 el 10% y Turno 3 el 60%).
bins = [0, 7, 15, 24]
# add custom labels if desired
labels = ['00:00-07:59', '08:00-15:59', '16:00-23:59']
# add the bins to the dataframe
trainSet['time_bin'] = pd.cut(trainSet['hour'], bins, labels=labels, right=False)
orderByBinnedHour=trainSet[(trainSet['final_status']=='DeliveredStatus') & (trainSet['country_code']=='ES')].groupby('time_bin')['order_id'].count().rename("percentage").transform(lambda x: (x/x.sum()))
orderByBinnedHour
time_bin 00:00-07:59 0.00 08:00-15:59 0.34 16:00-23:59 0.66 Name: percentage, dtype: float64
trainSet['final_status_binary']=pd.get_dummies(trainSet['final_status'], drop_first=True)
trainSet['products_total'].hist()
<AxesSubplot:>
ax=sns.countplot(x="payment_status", hue="final_status", data=trainSet)
total = len(trainSet)
for p in ax.patches:
percentage = f'{100 * p.get_height() / total:.1f}%\n'
x = p.get_x() + p.get_width() / 2
y = p.get_height()
ax.annotate(percentage, (x, y), ha='center', va='center')
plt.tight_layout()
plt.show()
trainSet.groupby(['payment_status','final_status'])['order_id'].count().rename("percentage").transform(lambda x: x/x.sum())
payment_status final_status
DELAYED CanceledStatus 0.00
DeliveredStatus 0.00
NOT_PAID CanceledStatus 0.00
DeliveredStatus 0.01
PAID CanceledStatus 0.11
DeliveredStatus 0.89
Name: percentage, dtype: float64
Based on that, It looks that when the status of a transaction is NOT_PAID the probability of being cancelled increase. Hence It will be included in our model
To assess this information a chi-square test will be performed ( categorical vs categorical). If the H(0)=Indipendency cannot be rejected, the two variables will be considered dependent hence a variability in the final status can be expalined by the variability in the variable payment_statys
chi2, p, dof, expected = chi2_contingency((pd.crosstab(trainSet.payment_status, trainSet.final_status).values))
print (f'Chi-square Statistic : {chi2} ,p-value: {p}')
Chi-square Statistic : 102.42785141954717 ,p-value: 5.728945194717638e-23
We can reject the null hypothesis and conclude there is a relationship between payment_status and final_status
fig, (ax1, ax2) = plt.subplots(1,2)
ax1.boxplot(trainSet['n_of_products'],0)
ax1.set_ylabel("Number of product")
ax1.set_title("Product number - whisker plot", fontsize=12)
ax2.set_ylabel("Frequency")
ax2.set_xlabel("Quantity of products")
ax2.hist(trainSet['n_of_products'], color='darkslateblue',bins=40, alpha=0.9)
ax2.set_title(" Product number distribution of all transactions", fontsize=12)
plt.show()
-Outliers check:
- Median centered around 2
- 62% of the transaction has 2 or 1 product
- 89 % of the transaction has less or equal to 5 products
-
- Distribution positevely skewed
- Extreme values are expected according the right skewed distribution hence will not be removed
groupinProduct=trainSet.groupby('n_of_products').count().reset_index()
totalTransaction=pd.DataFrame()
totalTransaction['n_of_products']=groupinProduct['n_of_products']
totalTransaction['totalTransaction']=groupinProduct['final_status_binary']
groupinProduct=trainSet.groupby('n_of_products').sum().reset_index()
totalTransaction['numberDelivered']=groupinProduct['final_status_binary']
totalTransaction['percentageDelivered']=round(totalTransaction['numberDelivered']/totalTransaction['totalTransaction'],2)
fig = plt.figure()
ax = plt.axes()
ax.plot(totalTransaction['n_of_products'], totalTransaction['percentageDelivered'])
plt.title('Percentage of Delivered by Product number')
Text(0.5, 1.0, 'Percentage of Delivered by Product number')
sns.countplot(trainSet['n_of_products'], hue=trainSet['final_status_binary'])
/Users/alessandro/opt/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
<AxesSubplot:xlabel='n_of_products', ylabel='count'>
from scipy import stats# point-biserial correlation
# output is a tuple
result = stats.pointbiserialr(trainSet['n_of_products'], trainSet['final_status_binary'])
print(f'correlation between X and y: {result[0]:.2f}')
print(f'p-value: {result[1]:.2g}')
correlation between X and y: 0.02 p-value: 2.3e-05
Percentage of delivered is constant until 13 products ( which account the 99%+ of the data) hence It does not look that product number can be used in our model since it does not explain any variability of our target variable. Moreover as can be noticed by the Point biserial test - used for testing the dependency between number of products and our target variable we have a 0 correlation with an realiable p-value
ax1=sns.distplot(trainSet[trainSet['final_status']=='DeliveredStatus']['hour'], label='Delivered')
ax2=sns.distplot(trainSet[trainSet['final_status']=='CanceledStatus']['hour'], label='Cancelled')
plt.xlabel('Probability by Status', fontsize=15)
plt.ylabel('Probability Density', fontsize=15)
plt.legend(fontsize=15)
plt.show()
/Users/alessandro/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) /Users/alessandro/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
bin_orders = [0, 6, 12, 18,24]
labels = ['Night_orders', 'Morning_orders', 'Afternoon_orders','Evening_orders']
trainSet['bin_orders'] = pd.cut(trainSet['hour'], bin_orders, labels=labels, right=False)
groupinProduct=trainSet.groupby('bin_orders').count().reset_index()
totalTransaction=pd.DataFrame()
totalTransaction['bin_orders']=groupinProduct['bin_orders']
totalTransaction['totalTransaction']=groupinProduct['final_status_binary']
groupinProduct=trainSet.groupby('bin_orders').sum().reset_index()
totalTransaction['numberDelivered']=groupinProduct['final_status_binary']
totalTransaction['percentageDelivered']=round(totalTransaction['numberDelivered']/totalTransaction['totalTransaction'],2)
totalTransaction
| bin_orders | totalTransaction | numberDelivered | percentageDelivered | |
|---|---|---|---|---|
| 0 | Night_orders | 631 | 424.00 | 0.67 |
| 1 | Morning_orders | 6671 | 6,035.00 | 0.90 |
| 2 | Afternoon_orders | 21338 | 19,236.00 | 0.90 |
| 3 | Evening_orders | 25690 | 22,803.00 | 0.89 |
We can notice above that there is a different trend between the night orders (defines as order from minight until 6am) and the rest: in fact the probability that those are cancelled it's much higher. Based on this we will include Hour of delivery inside our model
trainSet['products_total'].describe()
count 54,330.00 mean 9.84 std 9.26 min 0.00 25% 4.13 50% 7.13 75% 12.77 max 221.48 Name: products_total, dtype: float64
fig, (ax1, ax2) = plt.subplots(1,2)
ax1.boxplot(trainSet['products_total'],0)
ax1.set_ylabel("Number of product")
ax1.set_title("Product number - whisker plot", fontsize=12)
ax2.set_ylabel("Frequency")
ax2.set_xlabel("Amount of transactions")
ax2.hist(trainSet['products_total'], color='darkslateblue',bins=500, alpha=0.9)
ax2.set_title(" Product number distribution of all transactions", fontsize=12)
plt.show()
fig, (ax1, ax2) = plt.subplots(1,2)
fig.suptitle('Focus on products_total < 1')
ax1.boxplot(trainSet[trainSet['products_total']<1]['products_total'],0)
ax1.set_ylabel("Number of product")
ax1.set_title("Product number - whisker plot", fontsize=12)
ax2.set_ylabel("Frequency")
ax2.set_xlabel("Quantity")
ax2.hist(trainSet[trainSet['products_total']<1]['products_total'], alpha=0.9)
ax2.set_title(" Product number distribution of all transactions", fontsize=12)
plt.show()
The Probability Density plot is almost overallping amongst the 2 class of our dependent variable except for the tail
ax1=sns.distplot(trainSet[trainSet['final_status']=='DeliveredStatus']['products_total'], bins=200,label='Delivered')
ax2=sns.distplot(trainSet[trainSet['final_status']=='CanceledStatus']['products_total'],bins=200, label='Cancelled')
plt.xlabel('Density Plot for Product Price', fontsize=15)
plt.ylabel('Probability Density', fontsize=15)
plt.legend(fontsize=15)
plt.show()
/Users/alessandro/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) /Users/alessandro/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
ax1=sns.distplot(trainSet[(trainSet['final_status']=='DeliveredStatus') & (trainSet['products_total']>50)]['products_total'], bins=20 ,label='Delivered')
ax2=sns.distplot(trainSet[(trainSet['final_status']=='CanceledStatus') & (trainSet['products_total']>50)]['products_total'], bins=20,label='Cancelled')
plt.xlabel('Density Plot for Product Price', fontsize=15)
plt.ylabel('Probability Density', fontsize=15)
plt.legend(fontsize=15)
plt.show()
/Users/alessandro/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) /Users/alessandro/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
bin_products = [0, 50,100, 260]
labels = ['Total_Amount< 60','60 < Total_Amount < 100', 'Total_Amount>100']
trainSet['bin_products'] = pd.cut(trainSet['products_total'], bin_products, labels=labels, right=False)
groupinProduct=trainSet.groupby('bin_products').count().reset_index()
totalTransaction=pd.DataFrame()
totalTransaction['bin_products']=groupinProduct['bin_products']
totalTransaction['totalTransaction']=groupinProduct['final_status_binary']
groupinProduct=trainSet.groupby('bin_products').sum().reset_index()
totalTransaction['numberDelivered']=groupinProduct['final_status_binary']
totalTransaction['percentageDelivered']=round(totalTransaction['numberDelivered']/totalTransaction['totalTransaction'],2)
totalTransaction
| bin_products | totalTransaction | numberDelivered | percentageDelivered | |
|---|---|---|---|---|
| 0 | Total_Amount< 60 | 53987 | 48,232.00 | 0.89 |
| 1 | 60 < Total_Amount < 100 | 321 | 253.00 | 0.79 |
| 2 | Total_Amount>100 | 22 | 13.00 | 0.59 |
result = stats.pointbiserialr(trainSet['products_total'], trainSet['final_status_binary'])
print('Product_total CONTINOUS:')
print(f'correlation between X and y: {result[0]:.2f}')
print(f'p-value: {result[1]:.2g}')
chi2, p, dof, expected = chi2_contingency((pd.crosstab(trainSet.products_total, trainSet.final_status).values))
print('\nProduct_total BINNED:')
print (f'Chi-square Statistic : {chi2} ,p-value: {p}')
Product_total CONTINOUS: correlation between X and y: -0.02 p-value: 3.1e-06 Product_total BINNED: Chi-square Statistic : 4806.578714548748 ,p-value: 7.333811402755825e-22
It might seems that the bigger is the total amount of the transaction, less is the probability to have our order completed. In fact it's much more likely that an order is delivered if the total amount is < 60 compared if the total amount is > 100. Unfortutanetly the quanitity of transactions that follow the latest pattern described does not represent a numerous sample inside our database.
sns.pointplot(x='country_code',y='products_total', hue='final_status_binary',data=trainSet)
<AxesSubplot:xlabel='country_code', ylabel='products_total'>
sns.countplot(x='country_code',hue='final_status_binary',data=trainSet, order = trainSet['country_code'].value_counts().index)
plt.title("Count of Cancelled vs delivered transaction")
Text(0.5, 1.0, 'Count of Cancelled vs delivered transaction')
sns.boxplot(x='country_code',y='products_total',hue='final_status_binary', data=trainSet)
plt.title("Age by Passenger Class, Titanic")
Text(0.5, 1.0, 'Age by Passenger Class, Titanic')
groupinProduct=trainSet.groupby('country_code').count().reset_index()
totalTransaction=pd.DataFrame()
totalTransaction['country_code']=groupinProduct['country_code']
totalTransaction['totalTransaction']=groupinProduct['final_status_binary']
groupinProduct=trainSet.groupby('country_code').sum().reset_index()
totalTransaction['numberDelivered']=groupinProduct['final_status_binary']
totalTransaction['percentageDelivered']=round(totalTransaction['numberDelivered']/totalTransaction['totalTransaction'],2)
totalTransaction.sort_values(by='percentageDelivered',ascending=False)
| country_code | totalTransaction | numberDelivered | percentageDelivered | |
|---|---|---|---|---|
| 4 | CR | 1000 | 926.00 | 0.93 |
| 11 | GT | 511 | 468.00 | 0.92 |
| 9 | FR | 1911 | 1,754.00 | 0.92 |
| 19 | RO | 1957 | 1,799.00 | 0.92 |
| 13 | KE | 84 | 77.00 | 0.92 |
| 8 | ES | 11554 | 10,634.00 | 0.92 |
| 16 | PE | 4284 | 3,923.00 | 0.92 |
| 5 | DO | 448 | 409.00 | 0.91 |
| 20 | TR | 5696 | 5,180.00 | 0.91 |
| 6 | EC | 2265 | 2,031.00 | 0.90 |
| 12 | IT | 2537 | 2,276.00 | 0.90 |
| 21 | UA | 3729 | 3,330.00 | 0.89 |
| 15 | PA | 909 | 806.00 | 0.89 |
| 7 | EG | 1643 | 1,447.00 | 0.88 |
| 17 | PR | 29 | 25.00 | 0.86 |
| 10 | GE | 485 | 415.00 | 0.86 |
| 3 | CL | 994 | 857.00 | 0.86 |
| 0 | AR | 11854 | 10,107.00 | 0.85 |
| 14 | MA | 1446 | 1,222.00 | 0.85 |
| 18 | PT | 818 | 684.00 | 0.84 |
| 22 | UY | 169 | 125.00 | 0.74 |
| 2 | CI | 6 | 3.00 | 0.50 |
| 1 | BR | 1 | 0.00 | 0.00 |
Most numerous transaction have different probability of being marked as delivered. Example:
Both together account around the 40% of the database hence they can potentially contribute to explain the variability of our depdentent variable.
Introducing a variable like Store adress with an huge cardinality (more then 5500 unique value) has many issues:
Since it seems that over a certain # of transaction the probability of delivery increase, I will create a new variable which will determin if the store_address has <25 transaction or >25 transaction (This threshold is been chosen based on the fact that after 25 transaction the probability of delivery goes from 0.8 to 1 whilst before from 0 to 1
groupinProduct=trainSet.groupby('store_address').count().reset_index()
totalTransaction=pd.DataFrame()
totalTransaction['store_address']=groupinProduct['store_address']
totalTransaction['totalTransaction']=groupinProduct['final_status_binary']
groupinProduct=trainSet.groupby('store_address').sum().reset_index()
totalTransaction['numberDelivered']=groupinProduct['final_status_binary']
totalTransaction['percentageDelivered']=round(totalTransaction['numberDelivered']/totalTransaction['totalTransaction'],2)
totalTransaction.sort_values(by='numberDelivered', ascending=False)
| store_address | totalTransaction | numberDelivered | percentageDelivered | |
|---|---|---|---|---|
| 1350 | 28671 | 455 | 433.00 | 0.95 |
| 685 | 12513 | 245 | 239.00 | 0.98 |
| 775 | 14455 | 227 | 215.00 | 0.95 |
| 1356 | 28712 | 221 | 209.00 | 0.95 |
| 1351 | 28675 | 228 | 199.00 | 0.87 |
| ... | ... | ... | ... | ... |
| 3870 | 62760 | 2 | 0.00 | 0.00 |
| 2674 | 51139 | 1 | 0.00 | 0.00 |
| 3864 | 62698 | 1 | 0.00 | 0.00 |
| 4780 | 68973 | 1 | 0.00 | 0.00 |
| 3137 | 56000 | 4 | 0.00 | 0.00 |
5627 rows × 4 columns
test=totalTransaction
plt.scatter(test['totalTransaction'],test['percentageDelivered'])
<matplotlib.collections.PathCollection at 0x123db49d0>
test=test[test['totalTransaction']<60]
plt.scatter(test['totalTransaction'],test['percentageDelivered'])
<matplotlib.collections.PathCollection at 0x124e23760>
The higher the number of transactions by shop, the higher the probability that the order is gonna be delivered - trend valid for shop whose number of transaction is higher of 60
fig, (ax1, ax2) = plt.subplots(1,2)
ax1.boxplot(test['numberDelivered'],0)
ax1.set_ylabel("Number of product")
ax1.set_title("Product number - whisker plot", fontsize=12)
ax2.set_ylabel("Frequency")
ax2.set_xlabel("Amount of transactions")
ax2.hist(test['numberDelivered'], color='darkslateblue',bins=500, alpha=0.9)
ax2.set_title(" Product number distribution of all transactions", fontsize=12)
plt.show()
trainSet['highTransaction']=0
store_address_small=totalTransaction[(totalTransaction['totalTransaction']<25) ]['store_address']
store_address_small
0 190
1 191
2 193
3 194
4 196
...
5622 74863
5623 74871
5624 74873
5625 74889
5626 75236
Name: store_address, Length: 5097, dtype: int64
index_store_address_small=trainSet[trainSet.store_address.isin(store_address_small)].index
# set individual value once more
trainSet.loc[index_store_address_small, 'highTransaction'] = 1
sns.countplot(x='highTransaction',hue='final_status',data=trainSet)
plt.title("Count of Cancelled vs delivered transaction")
Text(0.5, 1.0, 'Count of Cancelled vs delivered transaction')
X=trainSet.copy()
y=trainSet['final_status_binary']
country_binned=pd.get_dummies(trainSet['country_code'])
payment_binned=pd.get_dummies(trainSet['payment_status'])
X=pd.concat([X, country_binned], axis=1)
X=pd.concat([X, payment_binned], axis=1)
X.columns
Index(['order_id', 'local_time', 'country_code', 'store_address',
'payment_status', 'n_of_products', 'products_total', 'final_status',
'hour', 'time_bin', 'final_status_binary', 'bin_orders', 'bin_products',
'highTransaction', 'AR', 'BR', 'CI', 'CL', 'CR', 'DO', 'EC', 'EG', 'ES',
'FR', 'GE', 'GT', 'IT', 'KE', 'MA', 'PA', 'PE', 'PR', 'PT', 'RO', 'TR',
'UA', 'UY', 'DELAYED', 'NOT_PAID', 'PAID'],
dtype='object')
X.drop(['order_id', 'local_time', 'country_code',
'payment_status', 'n_of_products', 'final_status','time_bin', 'final_status_binary', 'bin_orders',
'bin_products','store_address'], axis=1, inplace=True)
X.columns
Index(['products_total', 'hour', 'highTransaction', 'AR', 'BR', 'CI', 'CL',
'CR', 'DO', 'EC', 'EG', 'ES', 'FR', 'GE', 'GT', 'IT', 'KE', 'MA', 'PA',
'PE', 'PR', 'PT', 'RO', 'TR', 'UA', 'UY', 'DELAYED', 'NOT_PAID',
'PAID'],
dtype='object')
clf = RandomForestClassifier()
clf.fit(X, y)
DT_Dummy = dx.Explainer(clf, X, y,
label = "Dummy Model - Random Forest")
mp_rf = DT_Dummy.model_parts()
mp_rf.result
mp_rf.plot()
Preparation of a new explainer is initiated -> data : 54330 rows 29 cols -> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray. -> target variable : 54330 values -> model_class : sklearn.ensemble._forest.RandomForestClassifier (default) -> label : Dummy Model - Random Forest -> predict function : <function yhat_proba_default at 0x121251670> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 0.0, mean = 0.892, max = 1.0 -> model type : classification will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -0.949, mean = 0.000287, max = 0.745 -> model_info : package sklearn A new explainer has been created!
np.random.seed(42)
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = 0.2, stratify=y, random_state = 42)
print(X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=1, random_state=42)
(43464, 29) (10866, 29) (43464,) (10866,)
# define model
model = RandomForestClassifier()
# evaluate evaluate
scores = cross_val_score(model, X_train, Y_train, scoring='f1', cv=cv, n_jobs=-1)
preds = cross_val_predict(model, X_test, Y_test)
print(scores)
print(classification_report(Y_test, preds))
[0.91165629 0.91479934 0.91734075 0.91627612 0.9118657 0.91244591
0.91363983 0.91592808 0.9154002 0.91472081]
precision recall f1-score support
0 0.20 0.12 0.15 1166
1 0.90 0.94 0.92 9700
accuracy 0.85 10866
macro avg 0.55 0.53 0.53 10866
weighted avg 0.82 0.85 0.84 10866
pipeline = imbpipeline(steps = [['smote', SMOTE(random_state=11)],
['classifier', LogisticRegression(random_state=11,
max_iter=1000)]])
param_grid = {'classifier__C':[0.001, 0.01, 0.1, 1, 10, 100, 1000]}
grid_search = GridSearchCV(estimator=pipeline,
param_grid=param_grid,
scoring='f1',
cv=cv,
n_jobs=-1)
grid_search.fit(X_train, Y_train)
cv_score = grid_search.best_score_
test_score = grid_search.score(X_test, Y_test)
print(f'Cross-validation score: {cv_score}\nTest score: {test_score}')
print(classification_report(Y_test, grid_search.predict(X_test)))
Cross-validation score: 0.8268647968793952
Test score: 0.827284946236559
precision recall f1-score support
0 0.15 0.34 0.20 1166
1 0.91 0.76 0.83 9700
accuracy 0.72 10866
macro avg 0.53 0.55 0.52 10866
weighted avg 0.82 0.72 0.76 10866
pipeline = imbpipeline(steps = [['smote', SMOTE(random_state=11)],
['classifier', RandomForestClassifier(verbose=2)]])
# Number of trees in random forest
n_estimators = [10,20]
# Maximum number of levels in tree
max_depth = [10,20]
# Create the random grid
random_grid = {'classifier__n_estimators': n_estimators,
'classifier__max_depth': max_depth
}
grid_search = RandomizedSearchCV(estimator=pipeline,
param_distributions = random_grid,
scoring='f1',
cv=cv,
n_jobs=-1)
grid_search.fit(X_train, Y_train)
cv_score = grid_search.best_score_
test_score = grid_search.score(X_test, Y_test)
print(f'Cross-validation score: {cv_score}\nTest score: {test_score}')
print(classification_report(Y_test, grid_search.predict(X_test)))
/Users/alessandro/opt/anaconda3/lib/python3.8/site-packages/sklearn/model_selection/_search.py:285: UserWarning: The total space of parameters 4 is smaller than n_iter=10. Running 4 iterations. For exhaustive searches, use GridSearchCV. [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.2s remaining: 0.0s
building tree 1 of 20 building tree 2 of 20 building tree 3 of 20 building tree 4 of 20 building tree 5 of 20 building tree 6 of 20 building tree 7 of 20 building tree 8 of 20 building tree 9 of 20 building tree 10 of 20 building tree 11 of 20 building tree 12 of 20 building tree 13 of 20 building tree 14 of 20 building tree 15 of 20 building tree 16 of 20 building tree 17 of 20 building tree 18 of 20 building tree 19 of 20 building tree 20 of 20
[Parallel(n_jobs=1)]: Done 20 out of 20 | elapsed: 3.8s finished [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 20 out of 20 | elapsed: 0.1s finished [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
Cross-validation score: 0.8254049707665005
Test score: 0.8141251277392983
precision recall f1-score support
0 0.14 0.36 0.20 1166
1 0.91 0.74 0.81 9700
accuracy 0.70 10866
macro avg 0.52 0.55 0.51 10866
weighted avg 0.82 0.70 0.75 10866
[Parallel(n_jobs=1)]: Done 20 out of 20 | elapsed: 0.1s finished
pipeline = imbpipeline(steps = [['smote', SMOTE()],
['under', RandomUnderSampler()],
['classifier', RandomForestClassifier()]])
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1000, num = 5)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 50, num = 11)]
# Create the random grid
random_grid = {'classifier__n_estimators': [1,2],
'classifier__max_depth': max_depth,
'classifier__max_features':max_features
}
grid_search = RandomizedSearchCV(estimator=pipeline,
param_distributions = random_grid,
scoring='f1',
cv=cv,
n_jobs=-1)
grid_search.fit(X_train,Y_train)
cv_score = grid_search.best_score_
test_score = grid_search.score(X_test, Y_test)
print(f'Cross-validation score: {cv_score}\n Test score: {test_score}')
print(classification_report(Y_test, grid_search.predict(X_test)))
pipeline = imbpipeline(steps = [['smote', SMOTE()],
['under', RandomUnderSampler()],
['classifier',AdaBoostClassifier()]])
n_estimators = [500 ]
# Number of features to consider at every split
# Maximum number of levels in tree
# Create the random grid
random_grid = {'classifier__n_estimators': n_estimators,
}
grid_search = GridSearchCV(estimator=pipeline,
param_grid= random_grid,
scoring='f1',
cv=cv,
n_jobs=-1, verbose=3)
grid_search.fit(X_train, Y_train)
cv_score = grid_search.best_score_
test_score = grid_search.score(X_test, Y_test)
print(f'Cross-validation score: {cv_score}\n Test score: {test_score}')
print(classification_report(Y_test, grid_search.predict(X_test)))
pipeline = imbpipeline(steps = [['smote', SMOTE()],
['under', RandomUnderSampler()],
['classifier',GradientBoostingClassifier()]])
n_estimators = [100,500, 700, 1000]
# Create the random grid
fixed_grid = {'classifier__n_estimators': n_estimators,
}
grid_search = GridSearchCV(estimator=pipeline,
param_grid= fixed_grid,
scoring='f1',
cv=cv,
n_jobs=-1)
grid_result=grid_search.fit(X_train, Y_train)
cv_score = grid_search.best_score_
test_score = grid_search.score(X_test, Y_test)
print(f'Cross-validation score: {cv_score}\n Test score: {test_score}')
print(classification_report(Y_test, grid_search.predict(X_test)))
Cross-validation score: 0.8437741347943433
Test score: 0.8386736630568271
precision recall f1-score support
0 0.17 0.40 0.24 1166
1 0.91 0.77 0.84 9700
accuracy 0.73 10866
macro avg 0.54 0.59 0.54 10866
weighted avg 0.84 0.73 0.77 10866
from matplotlib import pyplot
pyplot.errorbar(n_estimators, grid_result.cv_results_['mean_test_score'], yerr=grid_result.cv_results_['std_test_score'])
pyplot.title("XGBoost n_estimators vs F1")
pyplot.xlabel('n_estimators')
pyplot.ylabel('F1')
pyplot.savefig('n_estimators.png')
To avoid overfitting we will stop the # of estimators at 500
pipeline = imbpipeline(steps = [['smote', SMOTE()],
['under', RandomUnderSampler()],
['classifier',GradientBoostingClassifier()]])
n_estimators = [500]
# Create the random grid
fixed_grid = {'classifier__n_estimators': n_estimators,
}
grid_search = GridSearchCV(estimator=pipeline,
param_grid= fixed_grid,
scoring='f1',
cv=cv,
n_jobs=-1)
grid_result=grid_search.fit(X_train, Y_train)
cv_score = grid_search.best_score_
test_score = grid_search.score(X_test, Y_test)
print(f'Cross-validation score: {cv_score}\n Test score: {test_score}')
print(classification_report(Y_test, grid_search.predict(X_test)))
Cross-validation score: 0.843093288984176
Test score: 0.840020088164723
precision recall f1-score support
0 0.18 0.40 0.25 1166
1 0.92 0.78 0.84 9700
accuracy 0.74 10866
macro avg 0.55 0.59 0.54 10866
weighted avg 0.84 0.74 0.78 10866
testSet=pd.read_csv('test_X.csv', sep=';')
testSet['local_time'] = pd.to_datetime(testSet['local_time'])
testSet['hour']= pd.to_datetime(testSet['local_time'], format='%H:%M:%S').dt.hour
country_binned=pd.get_dummies(testSet['country_code'])
payment_binned=pd.get_dummies(testSet['payment_status'])
testSet=pd.concat([testSet, country_binned], axis=1)
testSet=pd.concat([testSet, payment_binned], axis=1)
X=testSet.copy()
X['BR'], X['CI'], X['CL'], X['CR'],X['GE'], X['GT'],X['KE'],X['PR'], X['PT'], X['RO'], X['UY'], X['DELAYED'] =0,0,0,0,0,0,0,0,0,0,0,0
X.drop(['order_id', 'local_time', 'country_code',
'payment_status', 'n_of_products','store_address'], axis=1, inplace=True)
index_store_address_small=X[testSet.store_address.isin(store_address_small)].index
X['highTransaction']=0
X.loc[index_store_address_small, 'highTransaction'] = 1
pred = grid_search.predict(X)
pred
array([1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1,
0, 1, 1, 0, 1, 1, 1, 1], dtype=uint8)
testSet['final_status']=pd.DataFrame(pred)
testSet['final_status'].to_csv('prediction.csv')
![]()